In this page, I will show you how I used publicly available data from the NHS to graph a time series of dementia diagnosis rates in England. This project has been developed for the PSY6422 Data Management and Visualisation module, which is part of the MSc in Psychological Research Methods with Data Science at the University of Sheffield. Please visit my github repository to find all documentation related to this project.
I found a public data repository of dementia diagnoses managed by the NHS. Data from this repository have been used to develop a Dementia Publication Dashboard, which provides an insight into the prevalence of dementia in England.
Not every person suffering from dementia has a formal diagnosis. The national goal (i.e., benchmark rate) is to ensure that at least 2/3 (≥ 66.7%) of those suffering from dementia have a formal diagnosis. The NHS calculates dementia diagnosis rates by comparing the number of recorded diagnoses against the number of people estimated to have dementia. Diagnosis rates are calculated for people aged 65 and above.
As I was playing around with the dashboard, I noticed that there were no comparisons of regional vs national dementia diagnosis rates. I thus saw the perfect opportunity to ask the following questions:
1) How have regional rates (compared to national rates) changed overtime?
2) Are there regions that have consistently been above or below the national diagnosis rates and the benchmark rate?
For those unfamiliar with English geography, here’s a picture depicting the 9 statistical regions of England:
Retrieved from: https://www.staffordministries.org/blog/2018/4/18/geography-101
This project was developed using RMarkdown. I used packages from these libraries:
library(tidyverse)
library(gghighlight)
library(rmarkdown)
library(purrr)
library(lubridate)
library(gganimate)
library(RColorBrewer)
library(jcolors)
library(plotly)
library(plyr)
#library(highcharter) am I including highcarts
Step 1: Get the data in
I used data from May 2016 until April 2021. Data were provided the NHS in separate files containing a year’s worth of data. I uploaded 5 raw data sets:
df_2017 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-apr-2017.csv") #data period start: May 2016
df_2018 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2018.csv")
df_2019 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2019.csv")
df_2020 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2020.csv")
df_2021 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2021.csv") #data period end: April 2021
Step 2: Check the data
Let’s have a look at the variables in the data set and a random sample of observations. The code below is representative of all data sets because they have been structured identically by the NHS.
names(df_2017) #returns the names of the variables in a data frame
## [1] "INDICATOR" "ORG_TYPE" "ORG_CODE" "NAME" "ACH_DATE" "MEASURE"
## [7] "VALUE" "DQ"
paged_table( #use paged_Table to make table output neat for publishing
sample_n( #returns a random sample of 10 observations
df_2017, 10))
A couple of challenges were clear:
Challenge 1: Dates were not in the correct format (YYYY/MM/DD) and class type (i.e., date).
Challenge 2: The variable “MEASURE” needed to be split into separate variables. The NHS collected several dementia measures but I was only interested in visualising the measure “DIAG_RATE_65_PLUS”
Challenge 3: Organisation types (ORG_TYPE) and codes (ORG_CODE) represent a mix of local, regional, and national entities. I needed to extract England and its regions. Overcoming this challenge involved doing research and using resources from the Office for National Statistics to determine which regional tags were relevant for this project.
Challenge 4: Develop code that would address challenges 1-3 simultaneously for all raw data sets. I could not merge the data frames after upload because they exceeded the memory capacity of my computer.
Solution:
class(df_2017$ACH_DATE) #spot the problem: class type is a character rather than a date
## [1] "character"
#create a list of the data frames so changes are applied to all of them
list_dfs <- list(df_2017, df_2018, df_2019, df_2020, df_2021)
#use map_dfr and pipes to create the processed data frame. I chose this method because all data sets were structured identically - thank you, NHS!
df_proc <- map_dfr(
list_dfs, ~{.x %>% #create a new date variable (DATE) by converting the old dates into the correct format
mutate(
DATE = as.Date(ACH_DATE, format = "%d%B%Y")) %>% #split the values of the MEASURE variable so each measure becomes its own variable
pivot_wider(
names_from = MEASURE,
values_from = VALUE) %>% #filter data using the ORG_TYPE codes that correspond to England and its regions
filter(
ORG_TYPE == "PHE_CENTRE" | ORG_TYPE == "COUNTRY_GEOGRAPHICAL")})
class(df_proc$DATE) #verify that class type is indeed a date
## [1] "Date"
paged_table(
sample_n(df_proc, 10)) #returns a random sample of the processed data frame in a neat output
Visual diagnostic
I developed a basic, diagnostic graph to get an insight of whether I needed to do more adjustments to the processed data.
ggplot(
data = df_proc,
mapping = aes(x = DATE,
y = DIAG_RATE_65_PLUS,
colour = NAME)) + #use colour = NAME so the regions get different colours
#customise y axis, lower and upper axis limit
ylim(
c(50, 80))+
#use geom_line to show regions as lines
geom_line(
size = 1) +
#use geom_hline to set the benchmark rate, i.e., 66.7
geom_hline(
yintercept = 66.7,
linetype = 1, #1 for a solid line
size = 1,
colour = "dark grey") +
#use the Paired palette colour from the RColorBrewer package
scale_color_brewer(
palette = "Paired") +
#customise theme to plain black and white
theme_bw() +
#assign graph labels
labs(
x = "Year", #x axis label
y = "Dementia diagnosis rate (%)", #y axis label
title = "Dementia diagnosis rates in England", #assigns chart tittle
caption = "Data Source: NHS") #assigns a caption
There were multiple issues with this graph, but the most salient one was that Yorkshire and the Humber was not shown consistently in the graph. This was likely due to a mistake in coding the raw data.
I standardised Yorkshire and the Humber and other regional names:
#correct "yorkshire and humber" to "yorkshire and the humber"
df_proc$NAME [df_proc$NAME == "YORKSHIRE AND HUMBER"] <- "YORKSHIRE AND THE HUMBER"
#use revalue to rename 'NAME' regions so they are no longer in all capitals
df_proc$NAME <- revalue(
df_proc$NAME, #"old value" first, then "new value"
c("ENGLAND" = "England",
"LONDON" = "London",
"WEST MIDLANDS" = "West Midlands",
"NORTH EAST" = "North East",
"YORKSHIRE AND THE HUMBER" = "Yorkshire and the Humber",
"EAST MIDLANDS" = "East Midlands",
"EAST OF ENGLAND" = "East of England",
"NORTH WEST" = "North West",
"SOUTH EAST" = "South East",
"SOUTH WEST" = "South West"))
#use 'data.frame' and 'complete' to ensure that the region and date combinations are stable. It's a small line, but it is essential.
df_proc <- data.frame(
complete(
df_proc, DATE, NAME))
Then I checked that the naming issues have been solved. I also made a few tweaks to the original diagnositc plot to make it more visually appealing.
#I used a structure similar to the previous graph. I will only comment code to highlight differences.
ggplot(
data = df_proc,
mapping = aes(x = DATE,
y = DIAG_RATE_65_PLUS,
colour = NAME)) +
ylim(
c(50, 80)) +
#introduce and customise x axis limits by the start and end date of data collection
scale_x_date(
limits = as.Date(
c("2016-05-31","2021-04-30")), #use 6 month breaks so the x axis values show intervals of 6 months
date_breaks = "6 month",
date_labels = "%b %Y",
expand = c(0.02,0)) + #use expand to control the white space between the axes and the data
geom_line(
size = 1) +
geom_hline(
yintercept = 66.7,
linetype = 2, #2 for a dashed line
size = 1, #change line colour to black
colour = "black") +
annotate( #label parameters for the benchmark rate line
"text",
x = as.Date("2016-10-31"), #controls where 'Benchmark' label appears in the graph
66.7, #y-value where the label is referenced
vjust = -0.5, #label positioning just above the line
label = "Benchmark rate", #name of the Benchmark label
size = 4) + #controls label size
#change the colour palette to "Spectral" for better contrast
scale_color_brewer(
palette = "Spectral") +
theme_bw() +
theme( #customise grid lines
panel.grid.minor = element_blank(), #no minor grid lines for any axis
panel.grid.major.y = element_blank(), #no major y-grid lines so we're only left with major x-grid lines which are helpful for reading the chart
#change margin size of the graph, value order: top, right, bottom, left
plot.margin = margin( #set parameters and units to mm
1,1,1,1,"mm")) +
labs(
x = "Month",
y = "Dementia diagnosis rate (%)",
title = "Dementia diagnosis rates in England",
subtitle = "From May 2016 to April 2021", #introduce a subtitle to make graph more informative
caption = "Data Source: NHS")
Developing an engaging plot
The plot shown above could answer my research questions, but it’s a boring graph, it’s hard to read, and it does not use the potential of R as a graphic tool.
I developed a series of interactive graphs with the objective to create a visualisation where trends were easily understandable and comparisons between regions could be easily made.
Attempt 1
I developed this animated plot with the intention to visualise better the progress of national and regional trends
#I used a structure similar to the previous graph. I will only comment code to highlight differences.
#save graph structure into an object that will be used later for animation
gr_int0 <- ggplot(
data = df_proc,
mapping = aes(
x = DATE,
y = DIAG_RATE_65_PLUS,
colour = NAME)) +
ylim(
c(50, 80)) +
scale_x_date(
limits = as.Date(
c("2016-05-31", "2021-11-30")), #change upper limit to a date beyond data collection to give space to animated labels
date_breaks = "9 month", #change data breaks from 6 to 9 month intervals to facilitate reading
date_labels = "%b %Y",
expand = c(.02,0)) +
geom_point() + #introduce points for the animation
geom_line(
size = 1) +
geom_hline(
yintercept = 66.7,
linetype = 2,
size = 1,
colour = "black") +
annotate(
"text",
x = as.Date("2016-10-31"),
66.7,
vjust = -0.5,
label = "Benchmark rate",
size = 4) + #introduce shadowtext
shadowtext::geom_shadowtext( #geom_shadowtext takes NAME values (i.e., regions) and places them next to each region line in the animation. You will see soon!
data = df_proc,
mapping = aes(
x = DATE,
y = DIAG_RATE_65_PLUS,
label = NAME), #animated labels will take NAME values
hjust=-0.1, #makes region names show just next to the line
vjust = 0, #makes region names show just next to the line
bg.color = "white") + #bg.color as white because there is no need for a background colour for our region names
scale_color_jcolors(
palette = "pal8") + #important change, introduce new colour package -jcolors- and new colour palette -pal8- to make animation more readable
theme_bw() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none", #remove legend since we'll have animated features from shadowtext
plot.margin = margin(1,1,1,1,"mm")) +
labs(
x = "Month",
y = "Dementia diagnosis rate (%)",
title = "Dementia diagnosis rates in England",
subtitle = "From May 2016 to April 2021",
caption = "Data Source: NHS") +
transition_reveal(DATE) #activate animate display of time series
#animation parameters
animate(
gr_int0, #object used to save the graph structure
fps = 7, #control speed of animation
end_pause = 35) #control immediate repetition of animation, so viewers have time to digest trends
I found the visualisation somewhat overwhelming, so I experimented with the function facet_wrap, which allows to split the animated graph.
Before using facet_wrap, I coded a “dummy” variable in the processed data set to control the order of the facets. I wanted to show England in the bottom part of the new graph and the regions on top.
#make a dummy variable for face_wrap later, future 'you' will thank me later
df_proc <- df_proc %>% #make new variable by revaluing ORG_TYPE values
mutate(
FACET_ORDER = revalue(ORG_TYPE,
c("PHE_CENTRE" = 1,
"COUNTRY_GEOGRAPHICAL" = 2))) #assign values 1 and 2 to control order
Then I implemented facet_wrap to get the graph below
#I used a structure similar to the previous graph. I will only comment code to highlight differences.
gr_int1 <- ggplot(
data = df_proc,
mapping = aes(x = DATE,
y = DIAG_RATE_65_PLUS,
colour = NAME)) +
ylim(
c(55, 80)) +
scale_x_date(
limits = as.Date(
c("2016-05-31","2021-11-30")),
date_breaks = "6 month", #change date breaks back to 6 month intervals for reading ease
date_labels = "%b %Y",
expand = c(.02,0)) +
geom_point() +
geom_line(
size = 1) +
geom_hline(
yintercept = 66.7,
linetype = 2,
size = 1,
colour = "black") +
annotate(
"text",
x = as.Date("2016-10-31"),
66.7,
vjust = -0.5,
label = "Benchmark rate",
size = 4) +
shadowtext::geom_shadowtext(
data = df_proc,
mapping = aes (x = DATE,
y = DIAG_RATE_65_PLUS,
label = NAME),
hjust=-0.1,
vjust = 0,
bg.color = "white") +
scale_color_jcolors(
palette = "pal8") +
facet_wrap( #introduce facets
~FACET_ORDER, #facets will be split based on order from "dummy" variable
ncol = 1, #number of columns is 1 so we have top and bottom subgrpahs shown individually
labeller = labeller(
FACET_ORDER = #labeller lets us name the subgraphs using values from the "dummy" order variable
c("1" = "Regional Trends",
"2" = "National Trend"))) +
theme_bw() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none",
plot.margin = margin(1,1,1,1,"mm")) +
labs(
x = "Month",
y = "Dementia diagnosis rate (%)",
title = "Dementia diagnosis rates in England",
subtitle = "From May 2016 to April 2021",
caption = "Data Source: NHS") +
transition_reveal(DATE)
animate(
gr_int1,
fps = 7,
end_pause = 35)
I also experimented with individual facets, breaking down each geography into individual graphs. I used the variable NAME rather than FACET_ORDER to control the splitting of graphs:
#I used a structure similar to the previous graph. I will only comment code to highlight differences.
gr_int2 <- ggplot(
data = df_proc,
mapping = aes(
x = DATE,
y = DIAG_RATE_65_PLUS,
colour = NAME)) +
ylim(
c(55, 80)) +
scale_x_date(
limits = as.Date(
c("2016-05-31","2021-04-30")),
date_breaks = "12 month", #change date break to 12 month intervals for reading ease
date_labels = "%b %Y",
expand = c(.02,0)) +
geom_point() +
geom_line(
size = 1) +
geom_hline(
yintercept = 66.7,
linetype = 2,
size = 1,
colour = "black") +
annotate(
"text",
x = as.Date("2017-05-31"),
66.7,
vjust = -0.5,
label = "Benchmark rate",
size = 4) +
shadowtext::geom_shadowtext(
data = df_proc,
mapping = aes (x = DATE,
y = DIAG_RATE_65_PLUS,
label = NAME),
hjust=-0.1,
vjust = 0,
bg.color = "white") +
scale_color_jcolors(
palette = "pal8") +
facet_wrap(
~NAME, #facets will now be determined by the NAME varaible so will get 10 facets
ncol = 2) + #set number of columns to 5 pairs of side-by-side facets
theme_bw() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none",
plot.margin = margin(1,1,1,1,"mm")) +
labs(
x = "Month",
y = "Dementia diagnosis rate (%)",
title = "Dementia diagnosis rates in England",
subtitle = "From May 2016 to April 2021",
caption = "Data Source: NHS") +
transition_reveal(DATE)
animate(
gr_int2,
fps = 7,
end_pause = 35)
Attempt 2
Whilst I appreciated the motion provided by the animated time series, I reckoned that a more effective visualisation would be less overwhelming and would allow to select different regions to facilitate comparisons
#I will fully comment this code to avoid confusion from graph structure changes
#save grapgh structure into an object for animation
gr_int3 <- ggplot(
data = df_proc,
mapping = aes(
x = DATE,
y = DIAG_RATE_65_PLUS,
colour = NAME)) + #use colour = NAME so the regions get different colours
ylim( #customise y axis, lower and upper axis limit
c(50, 80)) +
scale_x_date(
limits = as.Date( #introduce and customise x axis limits by the start and end date of data collection
c("2016-05-31","2021-04-30")),
date_breaks = "6 month", #use date breaks of 6 month intervals for reading ease
date_labels = "%b %Y",
expand = c(0.02,0)) + #use expand to control the white space between the axes and the data
geom_line( #use geom_line to show regions as lines
size = 1) +
geom_hline( #use geom_hline to set the benchmark rate, i.e., 66.7
yintercept = 66.7,
linetype = 2, #2 for a dashed line
size = 1,
colour = "black") + #line colour will be black
annotate( #label parameters for the benchmark rate line
"text",
x = as.Date("2016-12-31"), #controls where 'Benchmark' label appears in the graph
66.7, #y-value where the label is referenced
vjust = -0.5, #label positioning just above the line
label = "Benchmark rate", #name of the Benchmark label
size = 4,) + #controls label size
scale_color_brewer( #back to RColorBrewer and Spectral palette for aesthetics
palette = "Spectral") +
theme_bw() + #customise theme to plain black and white
theme( #customise grid lines
panel.grid.major.y = element_blank(), #no minor grid lines for any axis
panel.grid.minor.y = element_blank(), #no major y-grid lines so we're only left with major x-grid lines which are helpful for reading the chart
#change margin size of the graph, value order: top, right, bottom, left
plot.margin = margin( #set parameters and units to mm
1,1,1,1,"mm")) + #assign labels
labs(
x = "Month", #x axis label
y = "Dementia diagnosis rate (%)", #y axis label
title = "Dementia diagnosis rates in England - May 2016 to April 2021") #assigns chart tittle
#animation
ggplotly() %>% #final fix to the benchmark label so it is positioned correctly
style(textposition = "top")
I wasn’t quite happy with this graph so I made some adjustments:
I renamed the variables of interest so they wouldn’t show in all capitals when hovering over the graph
#rename
df_int <- dplyr::rename(
df_proc, #new name first, old name after = sign
Date = DATE,
Location = NAME,
Rate = DIAG_RATE_65_PLUS)
Then created a graph that would allow highlgihts, which required some changes to the graph structure previously used
#the d object will be useful for highlights based on Location (i.e, previosuly known as NAME variable)
d <- highlight_key(
df_int, ~Location)
#assign graph structure to an object, Not fully commented because previous comments apply to this structure.
gr_int4 <- ggplot(d,
mapping =
aes(x = Date,
y= Rate,
group = Location,
colour = Location)) +
ylim(c(50, 80)) +
scale_x_date(
limits = as.Date(
c("2016-05-31","2021-04-30")),
date_breaks = "6 month",
date_labels = "%b %Y",
expand = c(0.02,0)) +
geom_line(
size = 1) +
geom_hline(
yintercept = 66.7,
linetype = 2,
size = 1,
colour = "black") +
annotate(
"text",
x = as.Date("2016-12-31"),
66.7,
vjust = -0.5,
label = "Benchmark rate",
size = 4) +
scale_color_brewer(
palette = "Spectral") +
theme_bw() +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
plot.margin = margin(1,1,1,1,"mm")) +
labs(x = "Month",
y = "Dementia diagnoses rate",
title = "Dementia diagnosis rates in England")
gg <- ggplotly(gr_int4, tooltip = c("text", "x", "y")) %>%
style(textposition = "top")
highlight(gg,
"plotly_hover",)
## Setting the `off` event (i.e., 'plotly_doubleclick') to match the `on` event (i.e., 'plotly_hover'). You can change this default via the `highlight()` function.
gg %>% layout(hovermode = "x unified")
#the d object will be useful for hilghlights based on Location (i.e, previosuly known as NAME variable)
d <- highlight_key(
df_int, ~Location)
#assign graph structure to an object, Not fully commented because previous comments apply to this structure.
gr_int4 <- ggplot(d,
mapping =
aes(x = Date,
y= Rate,
group = Location,
colour = Location)) +
ylim(c(50, 80)) +
scale_x_date(
limits = as.Date(
c("2016-05-31","2021-04-30")),
date_breaks = "6 month",
date_labels = "%b %Y",
expand = c(0.02,0)) +
geom_line(
size = 1) +
geom_hline(
yintercept = 66.7,
linetype = 2,
size = 1,
colour = "black") +
annotate(
"text",
x = as.Date("2016-12-31"),
66.7,
vjust = -0.5,
label = "Benchmark rate",
size = 4) +
scale_color_brewer(
palette = "Spectral") +
theme_bw() +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
plot.margin = margin(1,1,1,1,"mm")) +
labs(x = "Month",
y = "Dementia diagnoses rate",
title = "Dementia diagnosis rates in England")
gg <- ggplotly(gr_int4, tooltip = c("text", "x", "y")) %>%
style(textposition = "top") %>%
layout(hovermode = "x unified")
highlight(gg,
#"plotly_hover",
defaultValues = "England")
## Setting the `off` event (i.e., 'plotly_doubleclick') to match the `on` event (i.e., 'plotly_click'). You can change this default via the `highlight()` function.
gg %>% layout(hovermode = "x unified")
Almost there
d <- highlight_key(df_int, ~Location)
gr_int4 <- ggplot(d,
aes(Date, Rate, group = Location)) +
geom_line()
gg <- ggplotly(gr_int4, tooltip = c("text", "x", "y"))
highlight(gg, "plotly_hover", defaultValues = "England")
## Setting the `off` event (i.e., 'plotly_doubleclick') to match the `on` event (i.e., 'plotly_hover'). You can change this default via the `highlight()` function.
ggplotly(gr_int3, tooltip = c("text", "x", "y")) %>%
style(textposition = "top") %>%
layout(hovermode = "x unified")
Brief thoughts on what you have learnt, what you might do next if you had more time / more data
attempt 4
library(tidyverse)
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
hchart(df_proc, "line", hcaes(x = DATE, y = DIAG_RATE_65_PLUS, group = NAME)) %>%
# hc_yAxis(reversed = TRUE) %>%
hc_plotOptions(
series = list(
events = list(
mouseOver = JS("function() { if(this.options.color !== 'red') {this.update({color: 'red'})} }"),
mouseOut = JS("function() { if(this.options.color === 'red') {this.update({color: '#ddd'})} }")
),
states = list(
hover = list(
enabled = FALSE,
lineWidth = 5,
colour = "red"
)
)
)) #%>%
#hc_colors("#dbdbdb")
attempt 3
highchart(type = "stock") %>%
hc_add_series(df_proc, "line", hcaes(x = DATE, y = DIAG_RATE_65_PLUS, group = NAME)) %>%
#hc_add_theme(hc_theme_darkunica()) %>%
hc_title(text = "Dementia Diagnoses Rates in England: National and Regional Trends")
attempt 2
hchart(df_proc, "line", hcaes(x = DATE, y = DIAG_RATE_65_PLUS, group = NAME))
Run
Resources:
From static to animated time series: the tidyverse way
https://blog.exploratory.io/filter-data-with-dplyr-76cf5f1a258e
HOW CAN I FORMAT A STRING CONTAINING A DATE INTO R “DATE” OBJECT? | R FAQ
https://stats.idre.ucla.edu/r/faq/how-can-i-format-a-string-containing-a-date-into-r-date-object/
Plotting multiple variables at once using ggplot2 and tidyr
https://scc.ms.unimelb.edu.au/resources-list/simple-r-scripts-for-analysis/r-scripts
changing ggplot2::facet_wrap title from the default https://stackoverflow.com/questions/48860158/changing-ggplot2facet-wrap-title-from-the-default/48860657